= ['Amigdalin', 'Fenfuram', 'Estradiol', '2-Methylbutanol']
molecules
with open('molecules.txt', 'w') as file:
file.write('\n'.join(molecules))
2 Python essentials - files and pandas
Now that you have reviewed the basic functionality, let’s look at how to read/write files, and how to use some of the most common packages (e.g., pandas
, matplotlib
, …).
Reading and writing files
Reading and writing files is an important aspect of programming. Let’s imagine you have the following list of molecules:
= ['Amigdalin', 'Fenfuram', 'Estradiol', '2-Methylbutanol'] molecules
and you would like to save it in a text file, one name per line.
To do so, you could run the following code:
Here, 'molecules.txt'
is the name of the file you want to write to. The 'w'
in 'w'
is for writing to the file. If the file does not exist, it will be created. If the file exists, its contents will be overwritten.
The write()
method is used to write to the file. You pass a string to the write()
method, and it writes that string to the file.
The with
statement is used to make sure that the file is closed properly after you are done writing to it.
The string that is written to the file is '\n'.join(molecules)
, where '\n'
is a line break and the .join(molecules)
statement joins all the elements in the molecules
list with '\n'
. We can print it below:
print('\n'.join(molecules))
If we want to read the molecules.txt
file that we have created and get back the list of molecules. We use a similar synthax.
with open('molecules.txt', 'r') as file:
= [molecule for molecule in file.readlines()]
molecules print(molecules)
Here, 'molecules.txt'
is the name of the file you want to read. The r
in r
is for reading the file (as opposed to writing to it).
The open()
function returns a file object that you can use to read the file. The readlines()
method is used to read the contents of the file, line by line. The lines are returned as a string.
However, as you can see, it still contains the line break character \n
at the end of each line.
Hence, we apply the .strip()
method to remove whitespace in front and at the end of the string.
with open('molecules.txt', 'r') as file:
= [molecule.strip() for molecule in file.readlines()]
molecules print(molecules)
Pandas
Pandas is a popular library for data analysis and manipulation in Python. It provides a way to store and manipulate data in a tabular form, similar to a spreadsheet.
Here’s how to use pandas for reading and writing files:
import pandas as pd
# Reading a CSV file
= pd.read_csv('file.csv')
df print(df)
# Reading an Excel file
= pd.read_excel('file.xlsx')
df print(df)
# Writing a CSV file
'file.csv', index=False)
df.to_csv(
# Writing an Excel file
'file.xlsx', index=False) df.to_excel(
Here, 'file.csv'
and 'file.xlsx'
are the names of the files you want to read or write. You will need to replace them with the actual names of the files you are working with.
The pd.read_csv()
function is used to read a CSV file, and the pd.read_excel()
function is used to read an Excel file. The functions return a pandas DataFrame (this is what df
stands for), which is a two-dimensional labeled data structure with columns of potentially different types.
The to_csv()
and to_excel()
methods are used to write a DataFrame to a CSV file and an Excel file, respectively.
The index argument is used to specify whether or not to write the index of the DataFrame to the file. If index is set to False, the index will not be written.
Because pandas
is not a standard Python library, we have to import the module. This the reason for the first line import pandas as pd
.
If you want to go more in-depth, there is also a great pandas tutorial on Kaggle Learn.
ESOL dataset
Let’s download a dataset containing molecules. We will use the Estimated SOLubility (ESOL) dataset by Delaney, as preprocessed by DeepChem.
The Delaney dataset is a collection of small organic molecules with experimental solubility values in water. It is often used as a benchmark dataset for testing and evaluating the performance of machine learning models in predicting the solubility of molecules. The dataset contains 1,084 molecules and a variety of molecular properties, including the solubility value, molecular weight, and atom-level information such as the number of atoms and types of bonds. We will download a .csv
file.
!wget "https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/01%20-%20Basics/data/delaney-processed.csv"
The delaney-processed.csv
file was successfully downloaded.
Here, we used the wget
command with a !
in front. In Jupyter notebooks, the !
symbol is used to run shell
/terminal
commands directly from the notebook.
You can use it to check the version of Python that you’re using:
!python --version
This feature is useful when you want to run shell commands directly from the notebook, without having to switch to a terminal or command prompt.
It’s important to note that the ! symbol only works in Jupyter notebooks, and not in regular Python scripts.
Exercise 01b_01
To practice this let’s use some more useful shell commands.
ls
- lists files in folderhead file
- show the first lines of the filemv file target_location
- move file to target locationmkdir folder_name
make a new folder
So, your use those commands to: 1. See if the delaney-processed.csv
is in the current folder (ls
) 2. Make a new folder called data
(mkdir
) 3. Move the delaney-processed.csv
file to the data
folder (mv
)
# write your code here, don't forget the `!` for shell commands
# if you want to see the solution, uncomment the following line
# %load https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/solutions/solution_01b_01.txt
Once you have correctly moved delaney-processed.csv
to the data
folder. You should be able to look at the first lines of the file with the following command.
!head data/delaney-processed.csv
Handling the ESOL dataset with pandas
Exercise 01b_02
You see above that the .csv
file contains comma-separated values
. So, let’s use pandas to:
- read the file into a DataFrame (use the
read_csv
function and assign it to thedf
variable) - show the first 5 rows using
df.head()
# Let's assume, we have not yet imported pandas
# Start by importing the pandas module
# Read the ESOL dataset into a DataFrame
# Inspect the first 5 rows of the DataFrame
# For the solution, uncomment the following line:
# %load https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/solutions/solution_01b_02.py
#On Google Colab you can run this command to make the dataframe interactive.
%load_ext google.colab.data_table
df
Exercise 01b_03
Let’s focus on a single colum for the moment.
You can access a specific column of the DataFrame by using square bracket notation and the name of the column, such as df["measured log solubility in mols per litre"]
.
You can also perform operations on a specific column of the DataFrame, such as calculating the mean value, by using methods such as mean()
.
Try this in the code cell below, and save the mean solubility in the variable mean_solubility
.
# Access a the "measured log solubility in mols per litre" column
# Calculate mean statistics on that column
# For the solution, uncomment the following line:
# %load https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/solutions/solution_01b_03.py
Once you have this you could the following example:
# Selecting rows based on conditions
= df[df["measured log solubility in mols per litre"] > mean_solubility]
high_solubility print(high_solubility)
# Adding a new column to the DataFrame
"Solubility Class"] = "Low"
df["measured log solubility in mols per litre"] > mean_solubility, "Solubility Class"] = "High"
df.loc[df[print(df.head())
# Grouping data by a column
= df.groupby("Solubility Class")
grouped print(grouped.mean())
# Sorting the DataFrame
"measured log solubility in mols per litre", ascending=False, inplace=True)
df.sort_values(print(df.head())
In this example, you can see how to select rows based on conditions using boolean indexing, add a new column to the DataFrame, group data by a column, sort the DataFrame, and write the DataFrame to a CSV file.
The df[df["measured log solubility in mols per litre"] > mean_solubility]
line selects rows from the DataFrame where the Solubility column is greater than the mean solubility.
The df["Solubility Class"] = "Low"
line adds a new column to the DataFrame, and the df.loc[df["measured log solubility in mols per litre"] > mean_solubility
, "Solubility Class"] = "High"
line sets the values in the new column based on conditions.
The grouped = df.groupby("Solubility Class")
line groups the data by the Solubility Class column, and the grouped.mean()
line calculates the mean value of each group.
The df.sort_values("Solubility", ascending=False, inplace=True)
line sorts the DataFrame in descending order based on the Solubility column.
That’s a brief overview of some of the basic operations you can perform on the ESOL dataset using pandas.
Additional pandas functionality
You can do many more things with pandas as also described in the documentation. Most of the questions that you will have will certainly will already have been answered by someone on StackOverflow, and you could use tools like ChatGPT as a personal interactive tutor.
Here, just some additional examples how you could merge/concatenate DataFrames:
# Merging DataFrames
= df[["Compound ID", "measured log solubility in mols per litre"]]
df1 = df[["Compound ID", "SMILES"]]
df2 = pd.merge(df1, df2, on="Compound ID")
merged print(merged.head())
# Concatenating DataFrames
= df[["Compound ID", "measured log solubility in mols per litre"]].head(10)
df3 = df[["Compound ID", "SMILES"]].tail(10)
df4 = pd.concat([df3, df4])
concatenated print(concatenated)
Or handle missing values in your DataFrame (there are no missing values in the ESOL dataset).
# Handling missing values
"Compound ID"].fillna("", inplace=True)
df[print(df.head())